SVM Classification for High-dimensional Imbalanced Data based on SNR and Under-sampling
نویسنده
چکیده
Support vector machine (SVM) is biased towards the majority class, in some case dataset is class-imbalanced and the bias is even larger for high-dimensional. In order to improve the classification accuracy of SVM on high-dimensional imbalanced data, we combine signal-noise ratio (SNR) and under-sampling technique based on K-means. In this article firstly we apply SNR into feature selection to reducing the feature amount then solve the problem of data imbalance using under-sampling technique based on K-means. To verify the feasibility of the proposed strategy, we utilize some metrics such as receiver operating characteristic curve (ROC curve) and area under the receiver operating characteristic curve (AUC value).As a result, the AUC value increased by 4%~16% before and after the process. The experimental results show that our strategy is feasible and effective exactly.
منابع مشابه
Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملA Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines
The class imbalance problem in classification has been recognized as a significant research problem in recent years and a number of methods have been introduced to improve classification results. Rebalancing class distributions (such as over-sampling or under-sampling of learning datasets) has been popular due to its ease of implementation and relatively good performance. For the Support Vector...
متن کاملFlexible high-dimensional classification machines and their asymptotic properties
Classification is an important topic in statistics and machine learning with great potential in many real applications. In this paper, we investigate two popular large-margin classification methods, Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD), under two contexts: the high-dimensional, low-sample size data and the imbalanced data. A unified family of classification ma...
متن کاملSupport Vector Machines for Class Imbalance Rail Data Classification with Bootstrapping-Based Over-Sampling and Under-Sampling
Support Vector Machines (SVMs) is a popular machine learning technique, which has proven to be very effective in solving many classical problems with balanced data sets in various application areas. However, this technique is also said to perform poorly when it is applied to the problem of learning from heavily imbalanced data sets where the majority classes significantly outnumber the minority...
متن کاملSample Subset Optimization for Classifying Imbalanced Biological Data
Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model...
متن کامل